To run this script for the ABE data: Rscript run_analysis_individual.R 2 markers_ABE sce_ABE.rds analysis_ABE.html
To run this script for the CBE data: Rscript run_analysis_individual.R 1 markers_CBE sce_CBE.rds analysis_CBE.html
run_analysis_individual.R:
params=commandArgs(trailingOnly=TRUE) rmarkdown::render(“analysis_individual.Rmd”,output_file=params[4], params=list(dataset_ind=params[1],save_file_name_markers=params[2],sce_save_file=params[3]))
knitr::opts_chunk$set(warning=FALSE,message=FALSE)
# output:
# html_document:
# number_sections: yes
# toc: yes
# keep_md: yes
library(bluster)
source("../../core_functions.R")
library("batchelor")
library("scater")
library(scran)
library(dplyr)
library(igraph)
library(RColorBrewer)
library(pheatmap)
library(Seurat)
library(destiny)
library(edgeR)
library(MASS)
library(energy)
library(ggthemes)
library(progeny)
#library(ggridges)
library(stringr)
library(ggrepel)
library(MHTmult)
set.seed(4444)
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE,
message = FALSE,
# dev = c("pdf"),
dpi=300
)
#params=c(2 ,"markers_ABE", "sce_ABE.rds", "analysis_ABE.html")
dataset_ind <- as.double(params[1])
sce_list_file <- "sce_list_gRNA_iBAR.rds"
save_file_name_markers <- params[2]
sce_save_file <- params[3]
folder_cellranger <- "../cellranger/cellranger700_count_46539_7105STDY13259924_GRCh38-2020-A"
yy <- "CBE"
if (dataset_ind==2){yy <- "ABE"}
cell_cycle_frequency_save_file <- paste0("cell_cycle_",yy,".csv")
cell_cycle_frequency_save_file2 <- paste0("cell_cycle_target_gene",yy,".csv")
proliferation_file <- paste0("proliferation",yy,".csv")
markers_file_all <- paste0("markers",yy,"_all.csv") #DE of all genes, also
#' at non-significant levels or if 0
markers_file <- paste0("markers",yy,".csv") #significant DE expression
file_gRNA_selection <- paste0("selected_gRNAs",yy,".csv")
file_energy_distance <- paste0("ed_",yy,".rds")
save_file_name_validation_data <- paste0("validation_lib_ann_",yy,".csv")
colourscheme_variant_class <- c("control"="lightgrey","canonical drug resistance"="darkblue","driver"="orange","drug addiction"="darkred")
We read in the SingleCellExperiment object, for which QC, gRNA and iBAR calling had been performed previously, and add meta-data from the pooled experiments.
We plot the number of gRNAs called in a cell that passed QC. If there are several gRNAs in a cell, then each of them is counted. The figure below shows that cells with gRNAs targeting splice-essential sites proliferate much less than the other groups. The gRNAs targeting splice-essential sites were added as a positive control to check their depletion, and will be removed from downstream analysis.
After the removal of the splice essential gRNAs, we have the following number of cells: 15496.
The following is a histogram of number of cells per gRNA, with the splice essential gRNAs excluded.
The following figure is a histogram of the number of different iBARs per gRNA.
The mean number of cells per iBAR-group is equal to 1.6302998.
We combine cells with the same gRNA-iBAR combination by averaging their gene expression as follows: we combine the counts and then normalise for library size using the logNormCounts function from the scuttle Bioconductor package (McCarthy et al. (2017)). We use the iBAR-gRNA combinations rather than iBARs. As the iBAR is only 6nt long, some identical iBAR sequence may have been inserted in multiple cells, whereas in combination with the gRNA, the iBAR creates a unique identifier for a small group of cells with the same genoytpe. Aggregating over groups of cells with the same iBAR and gRNA eliminates the problem of biases resulting from unequal numbers of descendant cells for different progenitor cells. We call a group of cells with the same iBAR-gRNA combination, and therefore the same genotype, an iBAR-group.
Number of different iBAR-groups: 9505
Computing the average number of iBAR-groups per gRNA including cells with multiple gRNAs:
## [1] 56.40553
Average number of iBAR-groups per gRNA (including cells with multiple gRNAs): 56.40553.
For each gRNA, we compute the number of iBAR-groups.
For iBAR-groups with several gRNAs, variant classes are ranked by their impact as follows: driver/drug addiction > canonical drug resistance > control. iBAR groups with both driver and drug addiction conferring gRNAs are discarded. Colours indicate the class with the highest impact for each cell.
We compute scores for pathway activity for each iBAR-group.
First we compute PROGENy scores, which quantify pathway activation as described in (Schubert?) and Holland et al. (2020).
We transform the PROGENy scores linearly such that their mean across the non-targeting control gRNAs is 0 and the standard deviation across the non-targeting control gRNAs is equal to 1.
First, we compute MAYA activity scores (Landais and Vallot (2023)). MAYA computes for each gene set a linear low dimensional representation of gene expression using principal component analysis. For each gene set and each principal component it computes a score for each iBAR-group in the data set. The method retains those scores found to be bimodal across the data set.
To test and adjust p-values across both the transcriptome and pathways, we first construct a SingleCellExperiment containing expression data for both the genes and the pathways scores.
We identify those gRNAs that are associated with drug resistance or drug addiction, driver mutations, or mutations to BCL2 and EGFR with potential high impact, and that are the only gRNA for at least 10 iBAR-groups.
The number of such gRNAs is equal to 20.
We compute differential expression across all genes as well as the PROGENy and MAYA pathway scores. First, we select those gRNAs with at least one gene or pathway with a p-value less than 10^-6. Following this selection, we control the expected average false discovery rate for DE testing for the selected gRNAs as in Benjamini and Bogomolov (2013).
We also save the markers for gene expression and pathway scores separately.
We compute energy distances (Replogle et al. (2022)), between targeting and non-targeting gRNAs, using iBAR-groups instead of cells.
We first reduce the dimension of our data set by using principal components in a way guided by the specific application, based on a basis of genes consisting of marker genes from both the CBE and ABE data sets with a minimum log2-fold-change of 0.25 (the same for both of the data sets).
Now we compute the energy distances based on the principal components computed above. We compute the energy distance between a targeting gRNA and the non-targeting gRNAs as the median of the energy distances between the targeting gRNA and each individual non-targeting gRNA with at least 10 iBAR groups.
We normalise the energy distances by dividing them by the median of the energy distances found across different non-targeting gRNAs for the respective data set.
We add the energy distances to the gRNA meta-data.
The following is a plot of the distribution of energy distance coloured by variant class.
The following plot is coloured by whether the gRNA confers resistance or not (binary). Resistance includes drivers and drug addiction.
For the differential expression analysis, we had selected those gRNAs with at least one gene or pathway score differentially expressed with a p-value lower than 10^-6. We then corrected the p-values from differential gene expression testing for the selected gRNAs to keep the expected average false discovery rate (FDR) across gRNAs below 0.1. The following gRNAs were selected by the p-value cutoff of 10^-6.
The number of selected gRNAs is equal to 13.
The percentage of selected among tested gRNAs (all gRNAs with at least 10 iBAR groups and which were categorised as conferring canonical resistance or drug addiction, or as driver mutations) is equal to 65%.
Across the selected gRNAs, we compute the number of differentially expressed genes (average expected FDR < 0.1):
The numbers of significantly differentially expressed genes are as follows:
## BRAF_5802f4 BRAF_82dc01 KRAS_2935b9 KRAS_71f09c MAP2K1_01c79e
## 28 676 2358 2624 656
## MAP2K1_273ccb MAP2K1_4802a9 MAP2K1_89a917 MAP2K1_a2b2c6 MAP2K1_ae4498
## 33 902 670 781 492
## MAP2K1_fd2674 MAP2K2_4d62ce MAP2K2_fdb5ed
## 312 198 635
As the level of significance is obviously influenced by the numbers of cells for the gRNA, we compare numbers of differentially expressed genes to cell numbers:
We also compare the energy distances to non-targeting control gRNAs for gRNAs impacting the transcriptome versus those not impacting the transcriptome. This shows that gRNAs with a significant transcriptional impact may still have energy distances to the control gRNAs that are similar to those for gRNAs that did not pass our threshold for significant transcriptional impact, especially if only a limited number of genes are differentially expressed.
Now we illustrate how the energy distances are related to the number of significantly differentially expressed genes.
We add a column to the validation_lib meta data with binary values indicating whether the gRNA led to significant transcriptional impact as defined by our threshold (at least one p-val < 10^-6).
How many gRNAs from each class (canoncial resistance, driver, drug addiction) have a significant impact as defined by our cutoff?
In this section we visualise log2-fold changes between impacted and control gRNAs.
The following heatmap includes genes that are statistically significantly (at FDR 0.1) differentially expressed with log2-fold-change > 0.5 for at least 1 gRNA. Using hierarchical clustering, we identify two groups of genes that are distinct in terms of their patterns of up- or downregulation across gRNAs. We also cluster the gRNAs.
We add the cluster information to the gRNA meta-data.
Finally, we replot the heatmap with the same order of gRNAs, only plotting cell-cycle related genes, and restricting to gRNAs found to be resistant in the pooled screens and having a significant overall transcriptional effect. The heatmap includes only strongly affected cell cycle related genes, where we used the following cutoff for inclusion in the figure: an absolute log2-fold change of at least abs(lfc) > 0.75 and FDR < 0.001 for at least one gRNA.
We plot the heatmap of PROGENy scores for all gRNAs in the same order as for the gene expression heatmap.
The following heatmap illustrates the log2-fold-changes compared to nontargeting gRNAs for the PROGENy pathway scores. The plot includes all gRNAs tested for DE (i.e. gRNAs that were not found to act like controls in pooled screens and had at least 10 iBAR groups assigned to them), even those without differentially expressed genes or pathways.
Below is the corresponding plot for the MAYA pathway scores.
We perform PCA for further downstream analysis and visualisation (energy distance, UMAP, diffusion score).
We use the genes identified above as differentially expressed between a gRNA and non-targeting controls as a basis of genes on which to perform principal component analysis (PCA). Genes are included if for at least one gRNA they have an FDR<0.1 and an abs(lfc) > 0.25. Unlike for the energy distances, for which we used a shared basis of genes for the CBE and ABE data sets for better comparability, we now use a basis of genes identified using the ABE data set only. We use these genes to compute UMAP and diffusion map representations.
We now plot the UMAP representation of the data. Each iBAR-group is a dot. For iBAR-groups with several gRNAs, variant classes are ranked by their impact as follows: driver/drug addiction > canonical drug resistance > control. iBAR groups with both driver and drug addiction conferring gRNAs are discarded. Colours indicate the class with the highest impact for each cell.
We plot the correlation of differential expression from non-targeting controls (measured by the log-fold change) across all classes of resistance targets.
We restrict the plot to drivers/drug addiction
We use DE expression data (log2-fold change) for patients with PFS > 6 months and patients with PFS < 6 months from the (Tian?). First, we select the genes with adjusted p-values < 0.1 and abs(log2-fold change) > 0.25 for differential expression at 15 days versus pre-treatment for PFS<6m or PFS>6m.
We compute rank correlation between our differential expression results and Tian et al.’s DE results for patients with progression free survival (PFS) < 6 months, based on the genes selected above.
We compare PFS outcome scores across the different variant classes.
We test for significance of difference of PSF outcome scores across variant classes.
We add the PFS_outcome score to the SingleCellExperiment, for iBAR clones with one resistance gRNA and no other gRNAs.
We compute diffusion scores as in Haghverdi, Buettner, and Theis (2015) and Cooper et al. (2024), for those resistance-conferring gRNAs (assessment of resistance in the pooled screens) that are the unique gRNA in at least 10 iBAR groups.
The following is a diffusion map coloured by the diffusion score.
The following plot shows the distribution of diffusion scores for each driver/drug addiction gRNA, ordered horizontally by their mean diffusion scores. For several of the gRNAs the diffusion scores have a multi-modal distribution, indicating either differences in editing between iBAR-groups of the same gRNA, or different responses to the same edits.
Now we re-colour the plot above by PFS_outcome score, keeping the horizontal ordering of the gRNAs as above.
The following scatterplot compares average diffusion scores for each gRNA to survival scores, illustrating the high negative correlation between the two scores.
Correlation between diffusion and PFS_outcome score: -0.6785292.
We plot up to 20 upregulated and 20 downregulated genes (the most statistically significant ones), using the markers computed previously for all the gRNAs passing our threshold for transcriptional impact (p-value < 10^-6 for at least one gene or pathway).
Now we perform differential expression for PROGENy scores across the variant types. To give equal weight to each gRNA, we subsample the same number of iBAR groups for each gRNA for any gRNA with more than 20 iBAR groups.
We plot the same figure for the PROGENy pathway scores.
We investigate whether there is a difference in proportion of cell cycle phases across variant types. First, we plot the relation between cell cycle and variant class.
## # A tibble: 9 × 3
## # Groups: variant_class, phase [9]
## variant_class phase n
## <chr> <chr> <int>
## 1 canonical drug resistance G1 240
## 2 canonical drug resistance G2M 63
## 3 canonical drug resistance S 78
## 4 control G1 6233
## 5 control G2M 1363
## 6 control S 938
## 7 drug addiction G1 279
## 8 drug addiction G2M 122
## 9 drug addiction S 189
Now we plot the relation between cell cycle phase and resistance as a binary variable.
We test whether there is a significant change in the number of iBAR groups for each cell cycle phase between gRNAs associated with resistance, and those not associated with resistance.
##
## Pearson's Chi-squared test
##
## data: table_compare_cc
## X-squared = 262.17, df = 2, p-value < 2.2e-16
We add the average diffusion score (for drug addiction(ABE)/driver(CBE)) and PSF outcome score to the validation library, and saving the annotated validation library and the SingleCellExperiment.